Web Analytics From Scratch

By Sam Reynoso

Here is a fun little project I built in an afternoon. This is how I prototype.

What Not to Build

As a self-taught programmer, I've always had to steer my own learning. That meant building everything myself — "Not Invented Here" wasn’t a syndrome, it was a feature. But now that I’m pushing a product out the door, I’ve had to do a complete 540 on that mindset.

This site gives me a release valve — a place to build without a filter. I don’t have to ask, “Does this add value?” or “Is this core to the product?” I just build. That said, this is still a blog, and I'd like to know if people are actually reading it.

So I set aside an hour to slap together a simple analytics system. Naturally, I blew past that time box by ~100%.

The Data Model

This is a blog. The main thing I care about is page view duration and scroll depth. The content is long and vertical, so bounce rate by scroll depth feels like a decent KPI... right? I also want the basics: unique visitor counts, daily traffic, and revisit frequency.

I settled on this as a starting point:



  // File: ./analytics/data/[date]/[client IP].json

  {
      "key": "17887",
      "position": 0,
      "timestamp": 1752269843077,
      "url": "http://localhost/"
  }

No database. Just raw JSON to disk. The file system can handle the volume, and I don’t care about real-time querying. But yeah, I’m locking myself into a pretty specific traversal pattern. If I wanted to know each time a specific user visited a specific URL, I’d have to scan every date dir, match the IP, and read each file line by line. But I probably won’t ever want that. (Famous last words.)

The point is: I’m capturing the data. I can always migrate it to a database later... which I won’t.

Client Code

I figured the client-side code would only take a few lines, and I was mostly right. Instead of firing updates on every scroll event, I went with a request/response model: the server sends down a key, and the client responds with its scroll position. This normalizes the update frequency, so reading, seeking, and skimming all produce the same number of updates per unit time. I traded fidelity for simplicity in how behaviors get bucketed.



    ws.onmessage = function (event) {
      sendMessage(event.data);
    };

    function sendMessage(key) {

      const message = {
        key: key,
        position: window.scrollY,
        timestamp: Date.now(),
        url: window.location.href
      };

      ws.send(JSON.stringify(message));
    }

That's all there is to it. The idea behind the key was that the client echos it back as part of update or the update is ignored. I'm hoping that is enough to prevent spam getting into the data. Other than that, I just send scroll position on receive from the server.

I Have the Data, Now What?

The server code is pretty boring. It's at the bottom of the page if you want to scroll down and look at it. I'll know if you do.

Now that I have data ingress, I need to think about what to do with it and how to present it. But wait, no I don't. This project serves no practical purpose other than as content for my blog. Lets not get ahead of ourselves. I'll just write a few matplotlib scripts and call it a day. And that's exactly what I did.



  def main():
      hist = ScrollHist()
      client_ips = lib.client_ips()
      for client_ip in client_ips:
          client_files = lib.client_files(client_ip)
          for file in client_files:
              data = json.loads(file)
              for datum in data:
                  hist.add_datum(datum)

      return hist

  if __name__ == "__main__":
      hist = main()
      hist.plot()

Again the code is super boring. I will save you from looking at helper functions and the plotting code. Also, I vibe coded the plot code, and doesn't show a heatmap like I asked. The variable names in the example code have been changed to protect the innocent.

Scroll Histogram
Scroll Histogram

Conclusion

I've wanted to do this project for a while. It's nothing magical, I might deploy it, or I might use a third-party solution to remove any temptation to add more features. I'm leaning very heavily towards, "Do not build this yourself."

Server Code



  DATA = {}


  def get_client_ip(websocket: WebSocket):
      x_forwarded_for = websocket.headers.get("x-forwarded-for")
      if x_forwarded_for:
          return x_forwarded_for.split(",")[0].strip()
      if websocket.client:
          return websocket.client[0]
      return "unknown"


  def get_client_data(date: str, client_ip: str):
      client_data = load_client_history(date, client_ip)
      if client_ip not in DATA:
          secret_key = str(random.randint(10000, 99999))
          DATA[client_ip] = {
              "secret_key": secret_key,
              "history": client_data
          }
      return DATA[client_ip]


  def load_client_history(date: str, client_ip: str):
      file_path = f"analytics/data/{date}/{client_ip}.json"
      if os.path.exists(file_path):
          with open(file_path, "r") as f:
              return json.load(f)
      return []


  def write_history(date: str, client_ip: str):
      date = time.strftime("%Y-%m-%d", time.localtime())
      client_data = get_client_data(date, client_ip)
      create_directory_structure(date)
      with open(f"analytics/data/{date}/{client_ip}.json", "w") as f:
          json.dump(client_data["history"], f, indent=4)
      print(f"[+] Data written for {client_ip} on {date}")


  def create_directory_structure(date: str):
      date = time.strftime("%Y-%m-%d", time.localtime())
      if not os.path.exists("data"):
          os.makedirs("data")
      if not os.path.exists(f"data/{date}"):
          os.makedirs(f"data/{date}")


  @app.websocket("/ws")
  async def websocket_endpoint(websocket: WebSocket):
      await websocket.accept()

      date = time.strftime("%Y-%m-%d", time.localtime())
      client_ip = get_client_ip(websocket)
      client_data = get_client_data(date, client_ip)

      try:
          secret_key = client_data['secret_key']
          assert isinstance(client_data['secret_key'], str), "Secret key must be a string."
      except KeyError:
          print("[-] No secret key found in data.")
          await websocket.close(code=1008, reason="No secret key found.")
          return
      except AssertionError as e:
          print(f"[-] Assertion error: {e}")
          await websocket.close(code=1008, reason="Invalid secret key format.")
          return

      try:
          while True:
              time.sleep(1)
              await websocket.send_text(secret_key)
              rec = await websocket.receive_text()

              print(f"[+] Received message: {rec} from {get_client_ip(websocket)}")

              try:
                  data = json.loads(rec)
                  assert isinstance(data, dict), "Position must be a dictionary."
              except json.JSONDecodeError:
                  print(f"[-] Invalid JSON received: {client_data}")
                  continue
              except AssertionError as e:
                  print(f"[-] Assertion error: {e}")
                  continue

              if data.get("key") != secret_key:
                  print(data)
                  print(f"[-] Invalid secret key in position data: {data.get('secret_key')}")
                  continue

              try:
                  client_data["history"].append(data)
              except KeyError:
                  print(f"[-] Error appending history: {data}")
                  continue

      except WebSocketDisconnect:
          print(f"[-] Disconnected") 

      write_history(date, client_ip)